perturbation text
PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training
Chen, Cong, Liu, Mingyu, Jing, Chenchen, Zhou, Yizhou, Rao, Fengyun, Chen, Hao, Zhang, Bo, Shen, Chunhua
This paper aims to address the challenge of hallucinations in Multimodal Large Language Models (MLLMs) particularly for dense image captioning tasks. To tackle the challenge, we identify the current lack of a metric that finely measures the caption quality in concept level. We hereby introduce HalFscore, a novel metric built upon the language graph and is designed to evaluate both the accuracy and completeness of dense captions at a granular level. Additionally, we identify the root cause of hallucination as the model's over-reliance on its language prior. To address this, we propose PerturboLLaVA, which reduces the model's reliance on the language prior by incorporating adversarially perturbed text during training. This method enhances the model's focus on visual inputs, effectively reducing hallucinations and producing accurate, image-grounded descriptions without incurring additional computational overhead. PerturboLLaVA significantly improves the fidelity of generated captions, outperforming existing approaches in handling multimodal hallucinations and achieving improved performance across general multimodal benchmarks.
- Asia > China (0.14)
- North America > United States > Texas (0.14)
- Leisure & Entertainment > Sports > Soccer (1.00)
- Leisure & Entertainment > Sports > Tennis (0.93)
- Transportation (0.70)
Exploring the Multilingual NLG Evaluation Abilities of LLM-Based Evaluators
Chang, Jiayi, Gao, Mingqi, Hu, Xinyu, Wan, Xiaojun
Previous research has shown that LLMs have potential in multilingual NLG evaluation tasks. However, existing research has not fully explored the differences in the evaluation capabilities of LLMs across different languages. To this end, this study provides a comprehensive analysis of the multilingual evaluation performance of 10 recent LLMs, spanning high-resource and low-resource languages through correlation analysis, perturbation attacks, and fine-tuning. We found that 1) excluding the reference answer from the prompt and using large-parameter LLM-based evaluators leads to better performance across various languages; 2) most LLM-based evaluators show a higher correlation with human judgments in high-resource languages than in low-resource languages; 3) in the languages where they are most sensitive to such attacks, they also tend to exhibit the highest correlation with human judgments; and 4) fine-tuning with data from a particular language yields a broadly consistent enhancement in the model's evaluation performance across diverse languages. Our findings highlight the imbalance in LLMs'evaluation capabilities across different languages and suggest that low-resource language scenarios deserve more attention.
- Research Report > Experimental Study (0.48)
- Research Report > New Finding (0.34)
Mapping the Mind of an Instruction-based Image Editing using SMILE
Dehghani, Zeinab, Aslansefat, Koorosh, Khan, Adil, Rivera, Adín Ramírez, George, Franky, Khalid, Muhammad
Despite recent advancements in Instruct-based Image Editing models for generating high-quality images, they are known as black boxes and a significant barrier to transparency and user trust. To solve this issue, we introduce SMILE (Statistical Model-agnostic Interpretability with Local Explanations), a novel model-agnostic for localized interpretability that provides a visual heatmap to clarify the textual elements' influence on image-generating models. We applied our method to various Instruction-based Image Editing models like Pix2Pix, Image2Image-turbo and Diffusers-Inpaint and showed how our model can improve interpretability and reliability. Also, we use stability, accuracy, fidelity, and consistency metrics to evaluate our method. These findings indicate the exciting potential of model-agnostic interpretability for reliability and trustworthiness in critical applications such as healthcare and autonomous driving while encouraging additional investigation into the significance of interpretability in enhancing dependable image editing models.
- Europe > Norway > Eastern Norway > Oslo (0.04)
- Europe > United Kingdom > England > Greater Manchester > Manchester (0.04)
- Europe > United Kingdom > England > East Yorkshire > Hull (0.04)
- Media > Photography (1.00)
- Information Technology (0.88)
- Health & Medicine > Diagnostic Medicine (0.67)
- Transportation > Ground > Road (0.66)